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__. This paper presents a model selection technique of estimation 

^vj ■ in semiparametric regression models of the type Yi = /3'Xi + .f{Ti) + 

Wi, i = 1,. . . ,n. The parametric and nonparametric components are 

estimated simultaneously by this procedure. Estimation is based on a 

pH , collection of finite-dimensional models, using a penalized least squares 

^0 ' criterion for selection. We show that by tailoring the penalty terms 

(-H , developed for nonparametric regression to semiparametric models, 

" ■ we can consistently estimate the subset of nonzero coefficients of the 

linear part. Moreover, the selected estimator of the linear component 

is asymptotically normal. 
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1. Introduction. The partially linear regression model was introduced 
by Engle, Granger, Rice and Weiss (1986) for the study of the relationship 
\^ . between weather and electricity sales and has received considerable attention 

■^ I over the last decade. Given n i.i.d. P observations ((Xi,ri), Yi), . . . , ((Xni^n), 

Q ■ Yn), the model is 

O' (1-1) Yi = P% + f{T^) + W^ = s{Xi,T,) + W^, 

T^ I where Wi, . . .,Wn are independent, identically distributed and zero mean 

d • error variables, assumed to be independent of (X, ?") G M"^ x R. We assume 

^ ! that / € J-a , where J-a is a class of smooth functions with degree of smooth- 

ness a. The appeal of the model lies in its flexibility. It can be used when a 
simple linear regression is adequate, apart from a covariate, usually a con- 
, founder, that is known to affect the response in a nonlinear fashion. More 

C^ ' generally, (1.1) can be regarded as a particular case of a multiple index 

model and can serve as a first step in a dimension reduction process. 
Two important aspects of optimality in estimating (1.1) are: 
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2 F. BUNEA 

01. Parsimonious selection of the X = {^i, ■ ■ ■ j^g} covariates that avoids 
both underfitting and overfitting. 

02. Asymptotic normahty of the selected estimator of (3. 

Let lo '^ {1, . . . ,q} be the index set of the nonzero components of f3. Let / 
be an estimator of Iq. We interpret P{I = Iq) — > 1 as 01. The asymptotic 
normality of (3j, the estimator of /3 corresponding to the selected index set /, 
is our 02. 

Model (1.1) was originally studied under the assumption that Iq = {I, . . . ,q} 
and a are known. We shall refer to this situation as Case 0. In this case, 
one can construct estimators of / and (3 using the knowledge of a and 
Iq, so there is no need for a model selection procedure. Wahba (1984), 
Green, Jennison and Seheult (1985) and Heckman (1986) suggested a least 
squares approach combined with spline smoothing for the nonlinear com- 
ponent; Chen (1988) proposed simultaneous least squares estimation of the 
parametric and nonparametric parts. The subsequent literature is vast, and 
we refer to Bickel, Klaassen, Ritov and Wellner (1993) and to the mono- 
graph by Hardle, Liang and Gao (2000) for an extensive bibliography. All 
suggested methods lead to asymptotically normal estimators of (3, provided 
that the estimators of / satisfy the minimum requirement 

(1-2) \\f-f\\l,=op{n~'/% 

for a > 1/2, where || • ||^j, is the L2(//t) norm and fix is the probability 
distribution of T; see, for instance. Lemma 11.2, page 202, in van der Geer 
(2000), or Chen (1988). Throughout this paper we shall use (1.2) as a prereq- 
uisite for 02 and assess it over functions in Lip* {a, L2{ij,t)) , with a > 1/2, 
defined in Section 2.1. We will compare the rate in (1.2) with the minimax 
rate in Corollary 3.1. 

In this paper we use model selection based on penalized least squares 
minimization as an estimation procedure. We construct a sequence of finite- 
dimensional approximating spaces for s and find the least squares estimator 
corresponding to each space. We compute the residual sum of squares corre- 
sponding to each estimator and then add a penalty term. Our final estimator 
is the one with the smallest penalized residual sum of squares. This yields 
estimators for /3, Iq and / simultaneously. 

The aim of this paper is to study the performance of such estimators when 
we relax the conditions of Case 0. We consider the following cases: 

• Case 2: Iq unknown and a known, a > 1/2. 

• Case 3: Iq and a unknown, a > 1/2. 

We show in Sections 3.2 and 4 that 01 (the consistency of the selected index) 
and 02 (the asymptotic normality of the selected estimator) hold in both 
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cases. We note that in Case 3 we need 01 and 02 to hold uniformly over 
the range of a. Notice further that the smoothness classes are nested, in 
the sense that ai > 02 implies Lip*(ai,L2(/iT)) ^ Lip*(a2)-^2(A'T))- Then, 
to establish the uniformity result it is enough to show that 01 and 02 hold 
for the smallest allowable a > 1/2. This is the approach we adopt in the 
sequel. 

We point out that our strategy of showing 02, in either case, involves two 
steps. The first one is to show that / stabilizes asymptotically, that is, that 

01 holds. The second one requires that the estimator of P of fixed dimension 
Iq be asymptotically normal. This could be done by quoting directly the 
result obtained in Case by Chen (1988), whose method of estimation is 
the closest to ours. However, the result of Chen (1988) holds over m times 
continuously differentiable functions / (cf. his Condition 2, page 138), so 
it is not directly applicable to our context, which covers nondifferentiable 
functions. We therefore address this situation here, and in fact we prove that 
the asymptotic normality of f3 of fixed dimension holds in the more general 
case: 

• Case 1: Iq known and a unknown, a > 1/2. 

02 was not studied in any of these three cases. 

Partial results on 01 were obtained in Case 3. Hardle, Liang and Gao 
(2000), for a time series version of model (1.1), used a kernel method to esti- 
mate / and cross validation to select Iq and the bandwidth simultaneously. 
They show 01 in Theorem 6.3.1, page 137, but the construction of their 
estimator depends on the unknown Iq, as their Assumption 6.6.7, page 158, 
imposes lower and upper bound restrictions on the bandwidth that depend 
on Iq. 

Chen and Chen (1991) used i?-splines to approximate the nonlinear com- 
ponent and an Akaike type technique for simultaneous estimation of Iq and /. 
They discuss 01 in their Proposition 2, page 334, under a condition on a 
random criterion that depends on the unknown Iq and /. Remark 3 verifies 
this condition only for | Jo| = 1 and under their Condition 4, page 326, which 
entails the existence of a lower bound on /; the further study of 01 is left 
as an open problem. We do not use any of these conditions here. 

The remainder of the paper is organized as follows. Section 2.1 contains 
the assumptions under which our results hold. In Section 2.2 we give the 
construction of our estimators. In Section 2.3 we derive upper-bound oracle 
inequalities for the risk of our estimators, and obtain rates of convergence 
as a consequence. Sections 3 and 4 are central to our paper. In Section 3.1 
we discuss penalty choices and their impact on the rates of convergence 
derived in Section 2.3. We prove 01 in Section 3.2. In Section 4 we show 
that the estimators of /3 are asymptotically normal for the three cases under 
consideration, therefore establishing 02. Section 5 provides conclusions. The 
proofs of intermediate results are given in the Appendix. 
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2. A penalized least squares estimator. In this section we devise estima- 
tors for f3 and / and establish their consistency. 

2.1. Preliminaries. We begin by giving a list of assumptions under which 
the results of this paper hold. 

Let fi be the joint probability distribution of X and T. Let fix and /ir de- 
note the probability distributions of X and T, respectively. Let Lip*(a, L2{ht)) 
denote a generalized Lipschitz space, for some smoothness parameter a > 0, 
and let | • \a^2 be the seminorm in this space [see, e.g., DeVore and Lorentz 
(1993), page 51, for definition and properties]. For some positive constant 
A > define Pq,2(^) = {5 e Lip*(a,L2(/iT)), \gU,2 < A}. 

Assumption 2.1. The support of /i is [0, 1] x /C, for some compact set 
/C C M"^, and on its support // admits a density with respect to the Lebesgue 
measure A that is bounded from below by /iq > and from above by /ii < 00. 
In addition, ;Ur([0, 1]) = 1 and /ix(A^) = 1- 

Assumption 2.2. There exists d > such that Tp = E{\Wi\^) < 00 for 

Assumption 2.3. For any leW\ {0}, Var(/'X|r = t) > for all t G 

[0,1]. 

Assumption 2.4. s g L2(/C x [0,1],A). 

Assumption 2.5. 9j = E{Xj\T = t) e 1^7,2 (^) for some 7 > 6/4 and 
some fixed constant b>3, for all j £ {I, . . . ,q}. 

Note that Assumption 2.3 ensures that model (1.1) is identifiable; see, for 
example. Lemma 11.2 of van der Geer (2000) or Lemma 3 of Chen (1988). 

Assumption 2.5 is a sufficient condition on the smoothness of 9j, j £ 
{1,. . . ,q}, which ensures that (3 can be estimated at the optimal n~^'^ con- 
vergence rate. See also Chen (1988), Heckman (1986), Speckman (1988) and 
van der Geer (2000) for other types of smoothness conditions, typically re- 
quiring the existence of a prespecified number of derivatives for 9j. 

2.2. The sieves and the estimators. We construct now a sequence of ap- 
proximating spaces for s in (1.1). This mimics the construction of approxi- 
mating spaces for generalized additive models, as in Barron, Birge and Massart 
(1999) or Baraud (2000). Let I = {ii,..., ij G 3, where 3 = V{{1, . . . , g}), 
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and V{F) denotes all subsets of a set F. Denote by [•] the integer part and 
by log2 the logarithm in base 2. Let 6 be a fixed constant, 6 > 3. Define 

A = [log2(n/logn)^/^], 

(2.1) J„ = [log2(n/logn)V2]^ 

5„ = 2^", iV„ = 2^". 

For kn G {An,An + l, . . . , J„} let Kn = 2'=". For each Kn G {-B„, . . . , A^„} = /C^ 
let 5"/^,^ be the linear space of piecewise polynomials of degree at most r — 1, 
based on a regular dyadic partition of size 1/Kn- Thus, Sk^ is the space of 
functions v on [0, 1] of the form v{t) = Y!j=x ^i(*)l({^ ^ * < ;£■})' where 
1(F) denotes the indicator of a set V . Note that dim(S'/f^) = rKn- Denoting 
the restriction of A to [0,1] by At, let {^jj^^J be an orthonormal basis in 
L2(At) for Sk^. For (/,K„) G J x /C„ define 

Sl,K^ = (Xji ,...,Xi^,(j)l(t),..., (prKn (*)) , 

where (•) denotes the linear span. Note that |J x ICn\ = 2'^ x (J„ — An + 1), 
where | • | denotes the cardinality of a set. Notice that diin{Si Xn) = l-^l + 
rKn. 

We recall that the approximating space Sxn is known to have good ap- 
proximation properties for a range of smoothness classes to which / and 
9j, j = 1, . . . ,q, may belong; see, for example, DeVore and Lorentz (1993). 
However, other choices are possible: spaces generated by piecewise poly- 
nomials based on an irregular partition of [0, 1] , wavelets or trigonometric 
polynomials; see Birge and Massart (1998) for a detailed discussion. 

Let pen{I,Kn) be a penalty term associated with Sj^Xn- We defer a de- 
tailed discussion on the penalty term to Section 3.1. For {I,Kn) G J x /C„ 
and u£Si,K„ letjniu)=n-^j:7=i[yi-u{Xi,Ti)]\ 

Definition 2.1. A penalized least squares estimator relative to the col- 
lection {Si^K„}{l,K„)&xK„ is any s£Sj j^^ such that 



(2.2) 7„(s) + pen(/,iC„) = inf inf 7„('u) + pen(/,i^„ 

Let Y = (Yi, . . . , YnY ■ Denote by X the nxq matrix with columns {Xij, . . . , 
^nj)' , 1 < i < ^, and by X/ the n x |/| matrix obtained from X by retain- 
ing the columns corresponding to the index set I C {1, . . . ,g}. Let /?/ be a 
vector in R' ' and let 5k„ be a vector in W ". Let Zx„ be the n x rKn 
matrix whose ith row is (f)i{Ti), . . . , (prK„{Ti)- Then, in matrix notation, our 
estimator achieves the infimum below: 

(2.3) inf inf {(Y - X,/3, - ZxJxJiY - ^iPi - ^kJkJ + pen(/, Kn)}. 
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Let Kn and / be the indices for which (2.3) is attained. Then, if the min- 
imization problem has a unique solution, following Seber [(1977), Theo- 
rem 3.7], the least squares estimators of /5/ and 5k„ are, respectively, 

(2.4) ~Pj = (X;~(Id - P^jX|)-ix;~(Id - P^JY 

and 

(2-5) %^ = (Z^^Z^J-^Z^jY-X;/3;), 

where P^ is the projection matrix on the space L^ generated by the 
columns of"z,^^. Thus, L^^ = {(<7(ri), . . .,g{T.n))'\g S 5^J and P^^ = Z^JZ'^^ x 
Z^ )~^Z'- . Id denotes the n xn identity matrix. 

For any measure u we denote by || • \\u the L2{i') norm. Let 

r„ = {||s||A<2exp(log2n)}. 

We note that, by Theorem 1.1 in Baraud (2002) P(r^) -^ 0. For technical 
reasons we consider as our final estimator of s, 

s(x,i) = s(x,i)]lr„- 

We denote by / the estimator of / corresponding to 5 p. . Hence, the esti- 
mators of the nonlinear and linear part are, respectively, 

/ = /lr. and /3 = /3lr„. 

We mention here the approximating spaces used in the three cases un- 
der consideration. We elaborate on this in Section 3.1. For Case 1 we use 
{Sio,k„}k„£K:^- In Case 2 we use {S^k^^^}i(i'j with Kn,a xn^/^^+i. Here 
and in the sequel the notation a x 6 means that a is an integer power of 2 
that differs from b by at most a factor of 2. In Case 3 we use {Si^Kn,a}i& fo'^ 
Kn,a ^ n^'^'^"^^, with a > arbitrarily close to zero. Notice that Kn,a £ ^n 
if 1/2 < a <{h — l)/2. We shall need later the approximation theory result 
given by (2.7), which holds for a £ (0,r). This motivates the choice of 6 > 3 
and the definition of r = [—^]- Also, note that Kn^a ^ n^'^""'"^ £ /C^ for any 
0<a<l/2. 

We show in Appendix A.l that, under Assumptions 2.1-2.5, the estima- 
tors are unique, except for a set of probability tending to zero. On this set 
we define our estimators to be identically zero. 

2.3. The consistency of the penalized least squares estimators. In this 
section we give finite sample upper bounds on the risk of the estimator s 
of s. As an immediate consequence we then obtain rates of convergence 
for the estimators of /? and /, respectively. Oracle type inequalities for the 
risk of estimators obtained via model selection in nonparametric regres- 
sion have been studied extensively over the last decade; see, for example. 
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Barron, Birge and Massart (1999), Baraud (2000, 2002), Wegkamp (2003) 
and the references therein. The results carry over to semiparametric regres- 
sion and in this paper we adopt the approach of the second author. 

Let fxn be the L2{ht) projection of / onto Sk„- Let Ci,C2 > denote 
dominating constants independent of n, given by Theorem 2.1 in Baraud 
(2002). 

Theorem 2.1. Under Assumptions 2.1-2.5, for pen(/, K„) > Ci(|I| + 
rKn)/n, 

(2.6) E\\s - s\\l < C2 inf {||/ - /kJI^ + pen(/o, ^n) + ^n}, 
with ^n = l/n+ (||s||| + l)exp(-21og2n). 

We note that Ci depends on T2 and that C2 depends on hQ,hi,d,p,Tp. 

This theorem aUows us to obtain rates of convergence for s by computing 
the infimum above, provided that ||s||^ is bounded, in which case t?„ = 
0{n~^). This is guaranteed under Assumption 2.1 if / G Va^2{L) and /? G /Ci, 
for some compact set /Ci ^W. Furthermore, the next corohary shows that 
the rate of convergence of s is inherited by the estimators of (3 and /. 
Note that if / G Va^2{L) for some L > 0, then by Theorem 2.4, page 358, 
in DeVore and Lorentz (1993) and Assumption 2.1, for any a G (0,r) there 
exists C{a,L) > such that 

(2.7) \\f-fKX,<hC{a,L)K-'-. 
Define 

(2.8) r„ = inf{/iiC(a, L)ir-2" + pen(/o, K„) + K}- 

Let I • I2 denote the Euclidean norm. For a function g of generic argument 
Z we denote its empirical norm by HfifH^ = ^~^ ^^=1 g'^ (Zi) ■ In addition, by 
abuse of notation, we regard here Pj as a vector in M'^ by adding O's to the 
necessary positions. 

Corollary 2.1. Under Assumptions 2.1-2.5, for pen{I , Kn) > Ci(|/| + 
rKn)/n, if f £ Va^2{L), < a < r, we have: 

1- 11/ - /II^T = Op{rn), uniformly over f G Pq,2(-^) and /? G /Ci. 

2- \\f - fWn = Op{rn), uniformly over f G 'Da,2iL) and (3 G /Ci. 
3. \(3j — P\2 = Op{rn), uniformly in /? G /Ci. 

We present the proof of these two results in Appendix A. 2. We discuss 
in the next section our penalty choices and the corresponding values of r„. 
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Although / may achieve the mininiax optimal rate of convergence, Corol- 
lary 2.1 gives suboptimal rates for the estimator of /3. We show in Section 4 
that we can achieve the expected n~^' ^ rate of convergence for /3j:, provided 

that P{I = If)) ^ 1. In the next section we discuss penalty choices for which 
this holds. 

3. Penalty choices and the consistency of the selected index. There has 
been a vast literature on the estimation of /q in the fully parametric con- 
text and we only mention here the seminal works of Mallows (1973), Akaike 
(1974), Schwarz (1978) and Shibata (1981). Typically, model selection proce- 
dures based on a penalized criterion use penalty terms that are proportional 
either to \I\/n, where |/| is the dimension of a fitted model, or to |/| logn/n. 
These give rise to Akaike-type (AIC) and Schwarz-type (BIC) methods, 
respectively. In AIC the selected model is expected to include about one 
superfluous parameter [Woodroofe (1982)]. BIC chooses the correct model 
with probability converging to 1 [Haughton (1988)]. See also Guyon and Yao 
(1999) for a recent survey. 

In semiparametric models we cannot obtain the consistency of / by a 
simple extension of these methods, because a penalty term proportional to 
(|/| + Kn)logn/n no longer suffices. In the parametric case a penalty term 
essentially balances out the residual variance of competing models of differ- 
ent dimensions, and the bias term disappears for the models that include 
the true one. However, in the semiparametric case the penalty term is also 
required to balance out the bias introduced by approximating / within a 
finite-dimensional space. Since in general / does not belong to any of the 
approximating spaces, this bias is not zero, and a penalty that is propor- 
tional to the dimension of a fitted model is too small to achieve the correct 
balance. 

3.1. Penalty choices and rates of convergence. In this section we give 
sufficient conditions on the penalty term for which the optimality criteria 
01 and 02 hold. 

We first discuss 01, which is P{I / Iq) -^ 0. Note that 

(3.1) P{i^h) = P{h(^i) + P{h^i). 

We showed in Corollary 2.1 that the estimators of [3 are consistent, for any 
penalty term that satisfies pen(I, Kn) > (1-^1 + ^Kn) log n/n. We will show in 
Theorem 3.1 that the first term in (3.1) converges to zero, for any penalty 
term that satisfies this restriction. However, we cannot use the consistency 
of (5 J to show that the second term converges to zero, as we can overestimate 
the model but still consistently estimate O's. The study of the convergence 
to zero of the second term in (3.1) leads to the second set of restrictions on 
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our penalty term, namely, pen{I,Kn) — pen(/o,Kn) > /iiC(a)Li^~^°. This 
condition means that the penalty term needs to be greater than the bias 
induced by approximating /. Intuitively, if the bias due to the nonparamet- 
ric component is present, it acts as a confounder, and the true parametric 
dimension cannot be recovered. Formally, as in (3.21), this condition ensures 
that Iq can be found asymptotically. 

We show in Theorem 3.1 below that 01 holds if the two conditions below 
are satisfied simultaneously: 

(i) pen{I,Kn) -pen{Io,Kn) > hiC{a)LK-'^'', 
(ii) pen(/, i^„) > (|/| + rir„) log n/n. 

We discuss now sufficient conditions on the penalty term under which 
02 holds. Recall that (1.2) is a prerequisite for 02. With the notation of 
Section 2.3, (1.2) can be rewritten as 

(3.3) V^rn^O forrn = mf{hiC{a,L)K;;^'^ + pen{Io,Kn)+^n}- 

We show in Theorems 3.1 and 4.2 that 01 and 02 are compatible if 
(3.2) and (3.3) hold simultaneously. Note now the apparent contradiction 
between (3.2) (i) and (3.3): the first one essentially requires that the penalty 
term dominate the bias for all Kn, whereas the oracle inequality of Theo- 
rem 2.1 tells us that the best rate of convergence of / is achieved for a par- 
ticular Kn, namely, the one realizing the best bias- variance trade-off. Notice 
further that by simply taking pen(I, i^„) = (|/| + rKn)logn/n, we would 
have (3.2) (i) satisfied uniformly over a G (1/2, r) only if, up to multiplica- 
tive constants, Kn > n/logn for all Kn G JCn, in which case the estimators 
would no longer be defined. 

The first part of the solution is to construct a penalty term in which the 
dimensions |/| and rKn are multiplied rather than added. Thus, we first 
consider 

(3.4) pen(/,i^„) =2(|/| + l)rKnlogn/n. 

This penalty satisfies (3.2) (ii) and we show in Corollary 3.1 that it also leads 
to an r„ that satisfies (3.3). 

However, if we use (3.4) for either Case 2 or Case 3, then (3.2) (i) holds 
only if, up to multiplicative constants, Kn > {n/ logn)^/^""*"^ for all Kn G /Cn- 

If a > 1/2 is known, then for Kn,a x n^/'^°'+^ and n large enough, 

(3.5) pen(/,i^„,„) = 2(|/| + l)rJf„,„ log n/n 

satisfies (3.2) by construction, and (3.3) holds by Corollary 3.1, for this 
a. We use the penalty term (3.5) in Case 2, for the approximating spaces 
{Si,K„,^}i&- 
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If a is not known, as in Case 3, we can no longer define a penalty term that 
depends on a. In this case we need to construct a penalty that satisfies the 
contradictory requirements (3.2) (i) and (3.3) for all Kn and uniformly over 
a > 1/2. Since they cannot hold simultaneously for all Kn, the strategy we 
suggest is to find the best Kn for which they are met, uniformly over a > 1/2. 
For this, first recall that the smoothness spaces Lip*(a,L2(/iT)) are nested: 
the smaller the a, the larger the space and, also, the smaller the a, the larger 
the bias term l/K"- Thus, the largest bias that needs to be dominated by 
the penalty term will correspond to a = 1/2 + o for a > and arbitrarily 
close to zero, since this is the worst allowable case of a. This reduces the 
penalty choice to the one of Case 2, for this particular choice of a; namely, 
we choose Kn,a ^ n^/^"'+'^ and 

(3.6) pen{I,Kn,a) =2{\I\ +l)rKn,alogn/n. 

The corresponding approximating spaces are in this case {Si^K„^a}i€^- Then, 
uniformly over a, (3.2) holds by construction, and (3.3) holds by Corollary 
3.1. 

We remark now that in Case 1, in which Iq is considered known, (3.2) (ii) 
is no longer required. Only (3.2) (i), which also guarantees the consistency of 
the estimator of /3, and (3.3) are needed. We shall therefore use the penalty 
term (3.4), in connection with the approximating spaces {SiQ^Kn}KneK„, for 
this case. 

We conclude this section with Corollary 3.1, which is an immediate con- 
sequence of Theorem 2.1. This result summarizes the rates of convergence 
corresponding to each penalty term, and shows that (1.2) holds in each case. 
We give the proof in Appendix A. 2. 

Corollary 3.1. Under Assumptions 2.1-2.5, if f £ 'Da,2iL), we have: 

1. a unknown. 

(a) If pen{Io,Kn) = 2{\Io\ + lyKnlogn/n, then 

r„ = 0((logn/n)2-/2"+i), 

for any a G (1/2, r) and known Iq. 

(b) Ifpen{I,Kn)=2{\I\+l)rKn,alogn/n, then 

r„ = 0(logn/n(2«+i)/(2-+2))^ 

for any a G [1/2 + a,r), with a> arbitrarily close to zero. 

2. a known. 

If pen{I,Kn,a) = 2{\I\ + l)rKnM^ogn/n and a £ {1/2, r), then 

r„ = 0( log n/n2"/2"+i). 
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Remark 3.1. Notice that in 1(a) of the above corollary r„ is the mini- 
max adaptive rate of convergence, up to a logn factor. Since 2a /2a + 1 > 1/2 
for any a > 1/2, (1.2) holds uniformly over a G (1/2, r). 

For part 2, r„ is the niinimax rate, up to a logra factor, and (1.2) holds 
for the particular fixed a. 

For 1(b), although the rate is suboptimal relative to niinimax, we have 
(2a + l)/(2a + 2) > 1/2 for any a > 0, and so again (1.2) holds uniformly 
over a G [1/2 + a,r). 

Remark 3.2. We note that the rate of convergence in 1(a) is achieved 
for K* X (n/logn)^'^""''^. This dimension belongs to our /C„ for all 1/2 < 

a<r = [{b-l)/2]. 

3.2. The consistency of the selected index. 

Theorem 3.1. If f £'Da,2iL) and Assumptions 2.1-2.5 hold, then: 

(a) P{I / Jo) ^ as n -^ oo, for the penalty term given by (3.5) and for 
some given a G (1/2, r). 

(b) P{I 7^ Iq) — > as n^ oo, for the penalty term (3.6), uniformly over 
a G [1/2 + a,r) for some a > 0, arbitrarily small. 

Proof. Notice that 

(3.7) P{i^Io) = P{IoZi) + P{Io<^i). 

We show that each term in (3.7) converges to zero. 

1. P{IoCi)^0 as n^oo. 

(a) Notice that if Iq = {1, . . . ,q}, P{Iq C /) = 0, so it is enough to consider 
/oC{l,...,g}. Then 

(3.8) JimP(/oC/)=Jim5:P(/ = /). 

For I D Iq and re-denoting Kn = Kn.a ^ f^^' ^"^^, define 

(3.9) fn{I,Kn)= inf {7„(7;)+pen(/,K„)}. 

Recall that f^,^ is the L2{ht) projection of / onto Sk^- Define s„(x, i) = 
/5/o2£ + fK„{t)^ with /3/g regarded as a vector in R"^, by adding zero to the 
necessary positions. Notice that Sn G Sj^^Kn ^^^ so Sn G Sj^k^ ^^r all I D Iq. 
Also, by (3.9) note that 

fniIo,Kn) <7n(s„) +pen(Io,-K'„). 
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Then, by the definition of the estimator, we have 

P{i = I)= P(/„(/, Kn) - /„(/', K) < 0, for all r + I) 

(3.10) <Pifn{I,Kn)-fn{l0,Kn)<0) 



< P[ sup (7n(s„) - "Jniv)) > pen(/, Kn) - pen(Io, i^„ 

For any function g of generic argument Z we denote g = {g{Zi), . . . , g^Zn))' ■ 

For any U = (t/i, ...,[/„)' we let ||U||2 = i Er=i f^', (U, g)„ = i Er=i C/i5(^i). 
Notice then that, by the definition of 7„ and s„, we have 

Jn{Sn)-ln{v)=2{W,{-fKJrr-\\^-^KX 

+ 2(W, V - s)„, - lis - v||2 + 2||f - iKjl 
Let a„ = pen(/, K„) — pen(/o,i^„,). Then, by (3.10) we obtain 

P{i = I)<p( sup (2(W,v-s)„-||s-v||2>a„/3)') 



2 



(3.11) +P(2(W,f-f,^J„-||f-fKji:>aj3) 

+ m|f-feJl'>«n/6). 
Recall that Pi^^ is the projection matrix onto 

^x„ = {(5(^l),...,5(T„))'bGSi^JcM^ 

Define by ?/,;<'„ the projection matrix onto 

^/,/^„ = {(/i(Xi,ri),...,Mx„,T„))>e57,xJcM". 

Then 

2(W,v-s)„-||s-v||2 

= 2||P,,,„s - v|| /W, „;-^;'";; ) - ||P,,..s - v||^ 
\ ||P/,x„s- v||.„/.„ 

+ 2||P,xs-s|| /w, /^'^"'~', \ -||P,^s-s||2 



P/,XnS- v||n/n \ ||P/,X„S-s||„/„ 



W-P,,^„W, 

||P/,i^„s-v||„/„ 

+ /p.,i.„W,^^^^:^V + /w,. P^'-"^-^ 



P/,x„s-v||„/„ \ ||P/,x„s-s||„/„ 
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<l|P/,i^„W||^ + (W, 



P/,A'„S-S ^2 



|P/,/<„s-s|| 

using 2\xy\ < x^ + y^ for the first inequality and Cauchy-Schwarz for the 
last one. Using an identical reasoning, we also obtain that 



2(W,f-f^J„-||f-f^J|^<||P^,^W||2 + (w "" 



VkJ-^ '^ 



|Pi^„f-f|ln/ n 

All ratios introduced above are defined to be zero if the denominator is zero, 
in which case the first two probabilities in (3.11) are identically zero. Then, 
for the first two terms in (3.11), 

p( sup ('2(W,v-s),-||s-v||2>^ 

|2 ^ an\ , „//„, P/,/<„S-S \^ ^ an 



Similarly, 



<P ||P,,,,„W||->-^ +P (W, ^'--" ) > 



(3.13) 2 
_ y\ K„ Wn-Qj^ [\ '||P^J-f|U/„- 6 

We shall use Rosenthal's inequality, stated in Appendix A. 4, to bound the 
second term in both (3.12) and (3.13). The application of this inequality, as 
in the proof of Theorem 3.1, page 484, of Baraud (2000), leads to 

(3.14) ^f(w, J''"'"'" II y > ^) < Cip)E\W,\^n~^/'(^^~'^' 

\\ ||P/,i^„s-s||„/„ 6/ \6 

for any p > 2 and a constant C{p) that depends only on p. 

Recall that by the definition (3.5), pen(/, i^„) = 2(|/| + l)r/ir„ log n/n, 
and so 

(3.15) an = pen{I,Kn) -pen{Io,Kn) > 2K„logn/n, 

by the definition of the penalty term (3.5). Recall that, by Assumption 2.2, 
we have i?| VFi |^ < oo for any p > d + 4, for some d> 0. Then, by (3.14) and 
defining B = 3p/'^C{p)E\Wi\p, we obtain 

/ / P/ /< s - s \ 2 a„ \ B 

3.16 P (W, „^ ' " 7^) > — ]<- r^^O, 

^ ^ V\ '||P/,i^nS-s||n/„- ey- (K„logn)P/2 

and, using an identical argument, 

,3,17) p((w,^Zil^\%M<^L^^„. 

V\ \\t*Kj-t\\n/ n 6/ (i^„logn)P/^ 
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since Kn x f^i/2"+i _> qq for all a > 0. 

We will now invoke Corollary 5.1, page 478, in Baraud (2000) to bound 
the first term in either (3.12) or (3.13). We discuss first (3.12). Notice that 
i?||Pi^„W||^ = cr^tr(Pj<-^)/n = a'^rKn/n, using the standard properties of 
the projection operator. Then, by Corollary 5.1 of Baraud (2000), for any 
p>2 such that i?|M^i|P < oo and for any m > 0, 

K„ 



(3.18) P(n||P^„W||^ > ra^Kn + 2a^^VK;^ + a^m) < C"(p) V ^p/2 ' 

for some constant C'{p) > depending only on p and for i^p = E\Wi\P/aP 
and a^ = EWl 

We shall use (3.18) with m = KJ logn. Recall now (3.15) and notice 
that, for n large enough. 



na„/6 > ra'^Kn + 2a Vrif„i^n^^ log n + a'^Kll^logn, 

for all Kn G fCn- We mention that other choices of m are possible, but at 
the price of additional technicalities and with very little gain in terms of the 
overall result. Then, with B' = rC"(4)i?4 and using (3.18), we obtain 

,3.19) P(„||P,„W||J>!^)< ^^^,354-^^0. 

since, by Assumption 2.2, p> A + d> 8/3 for any d > 0. 

For (3.13) we first notice that S||P/,x„W||2 = a2(|I| +rKn)/n. Then, if 
we replace above rKn by |/| + rKn^ we also obtain 

,3.20) p(n||P„,W||;>!^)< ^,3,_,,f|;^^^^^^^ ^0. 

for an appropriately modified constant B" . 

We bound now the last term in (3.11). Recall the approximation er- 
ror bound of (2.7). By Markov's inequality and (3.15) we have, with C = 
3C2(a)L2/ii, 



Pfllf-f ||2 > ^A < ^^Hj^h/^IlIImt 



(3.21) 



^n lln 



6 



C n ^ C 

~ K^" Knlogn ~ logn 



by the choice of K^ x n^'^""'"^. 

Notice now that the number of terms in (3.8) is bounded by Ai, where 
Ai > 0, independent of n, is the number of models the linear part of which 
includes the Iq variables. Then, by (3.8), (3.11), (3.16), (3.17) and (3.19)- 
(3.21), we obtain P(/o '^ I) ^ 0, which is the desired result. 
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(b) The proof is almost identical for this case, in which we now re-denote 
Kn,a ^ n^/'^'^+^ by Kn and use the penalty term (3.6) instead of (3.5). Note 
that (3.16), (3.17), (3.19) and (3.20) only require that Kn — > oo, and thus 
they hold independently of a or a. The only difference is in (3.21), which 
now becomes 



P ||f_f^ f>^] < 

III'- '■-fi-n lln — n ; — 



(^E\\f-fKJl, 

6 J ~ an 

C n C n 

< -—K- X -— — < — Tj—rr X -— — (uniformly over a > 1/2 + a) 

c 

< > for Kn a ^ ni/2"+2 . 

logn 

Then the concluding argument is identical to the one above. 

2. P{Io ^ /) — > 0. The proof is the same in both cases. Let c = infjg/g \fij\ 
and notice that, by the definition of /q, we have c > 0. Consequently, \Pj„ — 
Pj7^ \ = \Pjn\, for all jn^Io\i, and 

P{Io €!)< PiJ'n i I for some j'n e ^o) 

< nhr. - /3id = i%i) < nk'. - /^id > c) - 0, 

by the component- wise consistency of /3| implied by 3 of Corollary 2.1. This 
completes the proof of this theorem. D 

4. Asymptotic normality. 

4.1. Asymptotic normality of (3if^. In this section we assume that Iq C 
{1, . . . ,q} is known. Then, with the notation of Section 2.2, (1.1) becomes 

(4.1) Yi = PlXi,,i + f{T,) + Wi. 

Here Xj^ denotes the vector of covariates corresponding to the index set /q . 
Let |/o| =90 for some known qo^q- In order to emphasize the dimension of 
the parametric part, we re-denote /3/,j by Pq^ and its estimator by Pg^. We 
first consider estimators for Pq^ and / within the family of approximating 
spaces {Sig^Kn}KneK:n- Recall that this corresponds to our Case 1 defined 
in the Introduction. We show in the following theorem that estimating / 
adaptively preserves the asymptotic normality of Pq^ . 

Let Sgo = {akj)qoxqo, with akj = Coy{Xk,Xj) - Gov {0kiT), 6 j{T)), for 
(j, k) elox Iq, and let Var(iy) = a"^. 

Theorem 4.1. Let Pq^ be the estim,ator of Pq^ based on the approxi- 
mating spaces {SiQ^Kn}KneK:n- Under Assumptions 2.1-2.5, if f £'Da,2{L), 
then 

V^iPqo-P<io)^Nq,{0,a%^'). 
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Theorem 4.1 holds uniformly over a G (1/2, r) and 7 G (6/4, r), 6 > 3. 
Corollary 4.1 is an immediate consequence of this theorem. Recall the 
definitions i^„,a = n^/^a+i ^^^ ^^^^ ^ ni/2a+2_ 

Corollary 4.1. Let fiq^^ he the estimator of (3q^^ based on either Sj^^^Xn „ 
or SjQ^Kna- Under Assumptions 2.1-2.5, if f ^T>a^2{L), then 

Corollary 4.1 holds uniformly over 7 S (6/4, r), 6 > 3. Also, it holds for 
any fixed a G (1/2, r) if Si^^^K^^a is used, and uniformly over a G (1/2 + a^r) 
for SiQ^Kn,a- We note that the estimators j3qQ are different in each situation; 
we have used the same notation for brevity, since we are only interested in 
their limiting distribution. The proofs are given in Appendix A. 3. 

4.2. Cases 2 and 3: asymptotic normality of (if. Throughout this sec- 
tion we regard f5j and fii^ as vectors in M'', by adding O's in the necessary 
positions. Also, we re-denote /5 G R*^ by /3/(, G W^ to emphasize that the only 
nonzero components of j3 correspond to the index set /q. Let V*^ = o"^S~^. 
Let V = (Vjj)gxg5 where Vjj = V^- for (z, j) £ Iq x Iq and zero otherwise. 

Theorem 4.2. If f£'Da,2{L) and Assumptions 2.1-2.5 hold, then we 
have the following results: 

Case 2. // pen(/, i('„^Q,) = 2(|/| -|- l)Er„^Q,logn/n, for some given 1/2 < 

a<r, then ^{f3j - Piq) ^ Ng{0, V) . 

Case 3. //pen(/,K„) = 2(|/| + l)K„,alogn/n, t/ien ^(/3|-/3/J 4iVg(0,V), 
uniformly over a G [1/2 -|- a, r) for < a < r — 1/2. 

In both cases, the limiting distribution has all its mass concentrated on the 
space generated by the Iq covariates. 

Proof. We prove that, in both cases, for any c G M"^, 

(4.2) c',/n0j - /?/o) ^ ^^(0, c'Yc) as n ^ cx), 
which leads to the desired result. For any 6 G M, c G M'^ we have 

P{c'{V^0j-(3i,))<b) 

(4.3) =P{c'{V^0j-Pj,))<b,i = Io) 

+ P{c'{V^0j-Pi,))<b,Iy^Io) 
= P{c'{^i0i, - Pi,)) < 6) - Pic'iV^iPi, - Pi,)) <h,i ^ h) 
+ P{c'{V^{Pj-(3i,))<b,I^Io). 
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Since, by definition, /3/g G M'^ has nonzero elements only in the positions 
corresponding to Iq, then 

(4.4) c'{V^0j, - /3,J) = c[,(V^(/3,„ - /3,J), 

where cq, /3go ^i^d /3go are obtained from c, Pig and /3/(,, respectively, by 
deleting the coordinates corresponding to zeros in /3/q. Now, we have that 

P{c'{V^0j - Pi,)) <bJ^Io)< P{i + h) 

and that 

P(c'(V^(/37o - /?/J) < 6, / / /o) < P(/ / Jo). 

By Theorem 3.1, P{I ^ Iq) ^ in both cases. Thus, from (4.3) and (4.4) 
we obtain 

(4.5) Jim P(^(c'(/3| - Pi,)) <b) = Jim P(V^(c[,(/3,„ - /3,J) < b). 

In both cases, the right-hand side in (4.5) converges to N{0,CQa'^T,~^co), by 
Corollary 4.1. Lemma A. 5, in Appendix A. 4, shows that the only symmetric 
and semipositive definite matrix V such that 

c'qV^cq = cVc for any c E M"?, 

is given by Vjj = V^- for (i, j) G IqX Iq and zero otherwise. Then (4.2) holds 
and the proof of the theorem is complete. D 

We discuss below the limiting covariance matrix given by Theorems 4.1 and 4.2, 
under the assumption that the error distribution is Gaussian. 

First note that the information bound for a regular estimator of P in (4.1) 
is (T^S~^, for some known go < Q] see, for example. Example 5, page 110, in 
Bickel, Klaassen, Ritov and Wellner (1993). Then, Theorem 5.1 shows that, 
for known Iq, Pq, is asymptotically efficient. 

However, if Iq itself is regarded as a parameter, as in our Cases 2 and 3, the 
classical information bound theory no longer applies and we need to resort 
to other means to assess the performance of our estimators. We thus verify 
whether our method, which is not based on a priori knowledge of Iq, leads to 
estimators with the same limit behavior as of those constructed knowing Iq- 
By Theorem 4.2 and the continuous mapping theorem, we obtain 

(4.6) V^{Pq,-Pq,)^Nq,{S),a^T.~^) 

and y/n[Pq^ — Pq^) — > 0, where Pq^ denotes the vector of zero coefficients in 
p. Then, indeed, Pq, achieves the information bound for Pq, in (4.1). Notice 
now that if Iq is known prior to estimation, then one can set Pq^ to zero, 
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whereas our method may estimate f3q^ by nonzero sequences. However, they 
converge to zero at an n~^'^ rate. 

For Cases 2 and 3 a much simpler method of estimation in (1.1) is to 
fit the model with all covariates included. This will reduce the computing 
time considerably, since we will only fit one model. We denote by (3 gW^ 
the corresponding estimator; note that this estimator is different in the two 
cases, but here we are only interested in its limiting distribution, so we use 
the same notation. This simpler procedure of estimation is also independent 
of Iq and, as above, we study the performance of (3 by investigating its 
asymptotic properties under the assumption that /q is known. Using the 
same reasoning as in Theorem A.l and, with (3jq denoting, as above, a vector 
in M"? having nonzero components only in the Iq positions, one can show 
that ^fn{^(i — /3/g) — > A''q(0,(T^S^^). Let I = Hq/a'^. Consider the partition 
of I in blocks In, I12, 121, 122, where In and I22 have dimensions go x Qo 
and 5i X gi, respectively. Then, as in Bickel, Klaassen, Ritov and Wellner 
[(1993), page 28], 

(4.7) ^/^(/3,o-/3,o)^A^.o(0,lrL2) 

and V^(/3gi - /3<?i) ^ iVgi(0,l22\), where I11.2 = In - Ii2l22^l2i and I22.1 = 
I22 — I21I11 Ii2- Note that the limiting distribution in (4.7) coincides with 
the one in (4.6) only if I12 = 0. Thus, although this procedure might be more 
appealing from a computational point of view, it leads to estimators with 
higher variance, if the true model corresponds to a proper subset of the full 
collection of the X covariates. 

5. Conclusions. This article studies simultaneous estimation of [3, Iq and 
/. We showed that one can consistently estimate Iq and obtain asymptoti- 
cally normal estimators for the selected estimator of (3. The construction of 
the approximating spaces used for estimation parallels the one used in para- 
metric or nonparametric model selection problems, but the penalty term 
needs to be adjusted for the semiparametric case. We summarize our find- 
ings in each of the cases under consideration. 

• Case 2: Iq unknown and a > 1/2 known. 01 and 02 hold. For the approx- 
imating spaces {Si^Kn^a}ie':i and the penalty term pen{I , Kn^a) = 2(|/| + 
l)rKn,a log n/n, with Kn^a ^ n^/^'^"'"^, we showed that P{I = Iq) —>-0 and 
that y/n{l3j — /3/q) -^ Nq{d,\) for a specified a. The rate of convergence 
rn of / is of order 0(logn/n^°'^""'"^), which is the minimax optimal rate 
for a given a, up to a logn factor. 

• Case 3: Iq and a > 1/2 unknown. 01 and 02 hold. For the approximating 
spaces {Si^Kn,a}i&j with Kn,a ^ n^''^"''^'^, a> arbitrarily close to zero, 
and for the penalty term pen(/, Kn,a) = 2(|/| + l)rKn,a logn/n, we showed 
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that P{I = Iq) — ^ and that y/n{l3i — (3ig) — > A'^g(0, V) uniformly over 
a > 1/2. The rate r„ of / is of order 0(logn/?i(2'J+i)/(2«+2)). 

We also note that in Case 1: Iq known, a unknown, for the approxi- 
mating spaces {Si(,^Kn}K„eK.„ and the penalty term pen(/,i^„) = 2(|/| + 

l)rKnlogn/n, we have shown that ^/n{(3lQ — Pj^) -^ Nq^{0,a'^T,~^) for qo = 

\Io\. Also, r„ = 0((logn/n)^"'^"+-^) for any a G (1/2, r). Thus, / is minimax 
adaptive, up to a log?i factor. 

APPENDIX 

A.l. The unicity of the least squares estimators. In this section we give 
the proof of the asymptotic unicity of our estimators. 

Lemma A.l. Under Assumptions 2.1 and 2.3, Z'j^^Zk„ is invertible for 
any Kn S /C„, except for an event whose probability tends to zero as n —> oo. 

Proof. Notice that, for any Kn^JCn, 

\ i=l ) l<j,j'<TK„ 

Then Z^ Z/^^ is invertible if and only if $ = n~^Z^ "^Kn is invertible. Let 

{CjVj=i be the eigenvalues of *, and let (1/C)max = sup{l/Ci, • • • , l/CrK„}- 
Thus, $ is invertible on the set where (1/C)max < Po < c>o, for some po > 0. 
By the proof of Lemma 3.1, page 492, in Baraud (2000), since {(^jYj^i are 
orthogonal in L2{\t-, [0, 1]), we have 

1\ \\u\\\ 

^ \ II 1 1 At- 

7 = sup II 112 ■ 

A /max u&Sk„\{0} W^Wn 

Notice that Assumption 2.1 implies that the density of T with respect to Ay 
is bounded above and below by Lhi and L/iq, respectively, where < L < oo 
is the Lebesgue measure of the compact set K. Also, under Assumptions 
2.1 and 2.3, Proposition A. 2, adapted to the case q' = 0, ensures that for 
some constant Di > condition Hcon'- \\u\\oo < Dl^/rNn\\u\\xJ, for all u G 
S^n of Baraud (2002) holds. Then, by Lemma 6.2, page 21, and the proof 
of Proposition 5.2, page 24, in Baraud (2002), for all po > L~^hQ and for a 
constant D > depending on L, /iq, /ii and po, we obtain 

P((7] >Po]=p( sup |^>Po 



C/max / ^ueSK„\{0} ll"" 



|2 

<P[ sup "ii '.^y > Po 



I 1 1 A'p 



«GSiv„\{0} iFlIn 
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where the first inequality holds because the approximating spaces are nested 
and the convergence to zero holds since N^ x (n/logn)^'^, by construction. 
This concludes the proof of this proposition. D 

Remark A.l. Since Kn,a >< n^/^a+i ^^^ j^^^^ ^ ^i/2a+2 belong to /C^, 
by construction (see page 903) the above result implies that Z'j^ '^Kn,a ^-^d 
Z'^ TiK^ „ are also invertible. 

Lemma A. 2. Under Assumptions 2.1-2.5, for any I C {1, . . . ,g}, Xj(Id — 
P^ )X/ and Xj(Id — Pi^^^)X/, a G (1/2, r), are invertible except for an 
event whose probability tends to zero as n— > oo. 

The proof of this lemma is based on Proposition A.l, which in turn re- 
quires the proof of Lemma A. 3. We use the following notation here and in 
the sequel. 

Notation A.l. Let S|j| = {akj)\i\,,\i\, with akj = Cov{Xk,Xj)-Cov{9k{T),ej{T)), 
for {j,k) E I X /, / G 1 Let 6, = (OjiTi), . . .,9j{Tn))\ Sij = X^j - %(r,), 
Ej = {eij, . . . , Enj)' , for j £ I and 1 <i <n. Also, let 6 and e be the n x |/| 
matrices having columns 6j and Sj, respectively. Let Iq = {1, . . . ,q}. Recall 
that for any U = (C/i, . . . , Un)' we denote ||U||2 = i Y.7=i Uf ■ 

Lemma A. 3. Under Assumptions 2.1-2.5, 
(A.l) ||(Id-P^J0,-t = Op((logn/n)V4), 

for any j£lq. 

Proof. Let 6„ = (log n/n)^/^. First notice that 

(A.2) P(||(Id-P^J0,||2>6„)< E mi(Id-P2.n)0,||^>6n), 

with Kn = 2^" ■ Next, observe that Fk„(^j is the projection of dj onto Lk„ = 
{{9{Ti), ■ ■ .,g{Tn))'; g G 5ir„}. For every K^ G K-n let Oj^k^ be the L2(mt) 
projection of 6j onto Sk„- Then 

||(Id-P^J0,||„<||0,--0,-xJ|„. 

By Assumption 2.5 Oj G V^^2{A)- Thus, by (2.7) WOj-Oj^kAIt < /iiC(7,yl) x 
K~'^'^ . Then, by Markov's inequality and recalling, by (2.1), that An = 
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[log2(ra/logn)-'^'''], 6 > 3, and denoting J„ = [log2(n/logn)^/^], we obtain 

p(||(id-p^j0,||2>6„)< Y. P{\\ej-e,,KAl>hn) 

J^ 1 ni/2 

for 7 > 6/4, which concludes the proof of this result. D 

Remark A. 2. The proof of the above result also implies that (A.l) 
holds with Kn replaced by any Kn S fCn- Thus, in particular, it holds for 
Kn,a >< ni/2"+i and Kn,a = nV2-+2. 

Proposition A.l. Under Assumptions 2.1-2.5 we have 
(A.3) X',(Id-P^jX,/n4S|,|. 

Proof. With the Notation A.l, each row in Xj can be written as 0' + 
e'y Then, for every j, k G I, we have 

(X^(Id-P^jX,)^., 
(A.4) 






Notice that s',£k/n — > Gj^, by the law of large numbers and the definition of 



Gjk- Also, as in the proof of Lemma 5 of Chen (1988), 

(A.5) P{\n~\e[P^^ej)\ > c) < rc~^n~^E{kn) ^ 

for any c > 0, since E{Kn) ^ -^n ^ (n/logn)^'2_ Notice now that, by the 
Cauchy-Schwarz inequality, 

n-'\e'^{Id-Pj^J9k\<\\ejU{Id-PjijekL^0, 

p 

since \\sj\\n ^-a.s CTjj, which is finite, and ||(Id — P^ )0/c||n -^ by Lemma A.3. 
By symmetry, we also have n~ 0(ld — P^ )e^. -^ 0. For the last term 
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in (A. 4), by the Cauchy-Schwarz inequality and by Lemma A. 3, and since 
Id — P^ is idempotent, we have 

= n-i|((Id-P^J0,)'((Id-P^J0,)| 
^^'^^ <ll(Id-P^J^.IIJI(Id-P^J0.L 

Remark A. 3. By Remark A. 2 and the proof above, we can conclude 
that (A. 3) holds with Kn replaced by any Kn G /Cn- Thus, as before, it holds 
for Kn,a >< ni/2"+i and K^^a = n^/2a+2_ 

Proof of Lemma A. 2. Let P denote any of the two projection matri- 
ces. We show that, for any cj G W^' \ {0} the sequence c'j'K'j(Id — P)X/C7 
tends in probability to oo, which implies that the corresponding matrix P 
is positive definite, hence invertible, except for a set of probability tending 
to zero. 

By Proposition A.l and Remark A. 3, we have that X'j'(Id — F)X.i/n — > 
S|/| for any of the two projection matrices. Then, by the continuous mapping 



theorem, for any c/ G M^ \ {0} we have that c'j'K'j(Id — P)X/c//n — > c'jT,\j\ci. 
If we denote S = Si/ i, then, by Assumption 2.3, for any nonzero / G . 

l'U = \av{l'{X-E{X\T))) 

Var{l'X\t)dfiTit)>0. 



[0,1] 

Thus T, is positive definite, and so is Si/i, for any / G J. Hence CjXj(Id — 
P)X/C/ -^ cxD, which completes the proof of this lemma. D 

A.2. The consistency of the penalized least squares estimators and rates 
of convergence. We first establish Theorem 2.1. This theorem is a direct 
consequence of Theorem 2.1 of Baraud (2002). The next proposition verifies 
that its condition (Hcon) holds. 

Proposition A. 2. If Assumptions 2.1 and 2.3 hold, then there exists 
a constant K >1 such that, for any g £ Sj ^n„ , 



Iblloo < Ky/q + rNn\\g\\x, 

where \\g\\oo = ^^Pz£K\9i^)\- ^^-^^ ^^^^ that, by Assumption 2.1 and by con- 
struction, gG L2{JC X [0,1], A). Here we recall that Sj^^Nn is the largest of 
the approximating spaces introduced in Section 2.2, with Iq = {1,. . . ,q} and 
Nn given in (2.1). 
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Proof. Recall that {(pjY^^I is an orthonormal basis in L2(At,[0, 1]) 
for the linear space S]\f„ defined in Section 2.2. Then we can write any 
g e 5'/^,Af„ as 5(x,i) = Ej=i a'jXj+J2'j=i bj(j)j{t). By Lemma 1, page 337, in 
Birge and Massart (1998), we have that maxjg[o,i] J2'j=i 4>]{t) < (2r - l)^iVn- 
Also, by Assumption 2.1, there exists an M > such that |X|2 < M with 
probability 1. Hence, applying the Cauchy-Schwarz inequality, we obtain 

q q rNn rNn 

II^IlL < 2E "j max 5^x2 + 2 Y, &i max Y. 4>]{t) 



(A.7) 



■ 1 2£G/C . ^ 



• 1 te[o,i] . , 



rN„ 



<2{M^ + {2r-lf){q + rN,MYa] + Yb]\- 

I i=i i=i J 

Next, we show that there exists a iCi > 1 such that 

q rNr, 

(A.8) Y.^] + Yh]<Ki\\g\\l 

i=i i=i 

Recall that Ojit) = E{Xj\T = t), j = I, . . . ,q. Since E{{Xj - ej{T))^k{T)] 
for all j and /c, we obtain 

^ 2 



r 9 rN„ 



(A.9) 



■ q 



i=i 



E\Y.<'AX,-(^o(T))\ +E\Yaje,{T) + YbMT)\ ■ 
Kj=i ) Kj=i j=i ) 



Recall that S = {crij)qxq, with aij = Cov{Xi,Xj) — Cov{6i{T),9j(T)), is pos- 
itive definite and so its smallest eigenvalue Amin > 0, by Assumption 2.3. 
Also, recall that, under Assumption 2.1, /i has on its support a density with 
respect to A that is bounded below by /iq > and above by /ii < oo. Then, 
from (A.9) and under Assumption 2.1, with a= (ai, . . . ,aq), it follows that 

hi\\g\\l>\\g\\l>El E«i(^i " ^i(^)) = a'Sa> An^mE^i- 
I i=i J j=i 

Let Xx denote the restriction of A to the compact set K. Then 






< 



rN„ 

1 



\x{K) 






since {(pjY^^i is orthonormal in L2{Xt, [0, 1])] 



[multiplying and dividing by XxiK)] 
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<j^^i\\9\\l + hoE{a%f), 

(adding and subtracting a'X and by Assumption 2.1) 

< w^(l + hohiM'^X-}j\\g\\l [by Assumption 2.1 and (A.IO)]. 

From the last two displays above we conclude that (A. 8) holds, with Ki = 
^ ?j^\ (l + hohiM"^ X^^^) + /iiA~;^. This, together with (A. 7), concludes the 
proof of this proposition. □ 

Proof of Theorem 2.1. Let si^k„ be the L2(//) projection of s onto 
Sj^Kn- The previous proposition and Assumptions 2.1-2.4 verify that Theo- 
rem 2.1 of Baraud (2002) can be invoked. Then 

E\\s - s\\l < C2 inf {\\s- si^kAI + P«^(^' ^") + ^"}- 

Recall now that s(x, i) = /3'x + fit)- For any Kn G /C„, let Sn(x, i) = /3'x + 
fxnit) S Sj^^Kni where we recall that /i^„ is the L2(//t) projection of / onto 
Sk„ ■ Then we have 

E\\s - sf < Ca inf {\\s - s„||^ + pen(/o,/s:„) + t?„} 

<C2inf{||/-/,,J|J^+pen(/o,/^„)+M- n 

Proof of Corollary 2.1. As an immediate consequence of Theo- 
rem 2.1 and the definition of r„ (2.8), we have \\s — s||^ = Op{rn)- Let now 
(X*,r*)~^, with (X*,r*) independent of (Xi,ri), . . . , (X,„T„). Write 
E* for integration with respect to (X*,T*) only. Notice that £^*((X* - 
i?*(X*|r*))?TT,(r*)) = for all bounded measurable functions m. Then 

\\s-s\\l = E*{s-sf{^\T*) 

= E*{{f - f){T*) + 0- (3yE*{X*\T*)}^ 
+ E*{{f] - /?)'(X* - E*{X*\T*))f. 
Since ||s — s||^ = Op{rn), then 

(A.IO) E*[{P - /3)'(X* - E{X*\T*))f = Op{rn). 

Since (X*,T*) is independent of /3, by construction we also have 

(A.ll) E*[0 - f3yiX* - EiX*\T*))f = 0- (i)'i:{(3 - /3) > A^in|/3 - p\l 

Thus, from (A.IO) and (A.ll), and since Amin > 0, we have that for any 

/3e/C, \P-(3\2 = Op{rl/^). 
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^ ^ ■*• 1/2 

If we let /i(x) = /5'x, then we also have ||/i — /i||^^ = Op{rn ). Thus 

II/-/IUt = II(/-/) + (/i-/i)-(/i-/i)IU 

<p-5iu+ii/i-/iiUx=op(^y') 

for any f £Va^2{L). 

For the empirical norm counterpart of this result, recall that we defined 

r„ = {||s||A<2exp(log2n)}. 

From Baraud [(2002), proof of Theorem 1.1, page 19] we find for some con- 
stants ci , C2 > that 

(A.12) P(r^) < ci{exp(-21og2n) + n2exp(-C2log^n)} ^0. 

Thus, it is enough to study the convergence rate of ||/ — /||n on r„,. Notice 
that on r„ we have s = s, /3 = /? and f = f- Thus 

Q n 

11/ - /||^r„ < 2\\s - s\\llr„ + - EK/^ " /3)'Xj'lr„ 

Tl . 

2 = 1 

(J n 

< 2\\~s - s\\l + - E{(/9 - P)%}' = Op{rn), 
n ^ 

because 0- l3)'n-^YJl=i^iKS- Po) = Op{rn), since \fi- l3\l = Op{rn), as 
above, and n~^Yl^=i'^i^i converges in probability. Also ||s — s||^ = Op{rn), 
by Corollary 3.2, page 474, in Baraud (2000), the conditions of which are 
verified by our assumptions and Proposition A. 2. This completes the proof of 
this corollary. 

D 

Proof of Corollary 3.1. We evaluate now (2.8) for each penalty 
choice. 

For the penalty term (3.4), the infimum is achieved for K* x [n/ log n)^' ^"^^ 
and hence r„ = 0((logn/n)^°/^°+^). 

For (3.5), r„ = 0(logn/n^°'^'^"''^) is obtained by replacing Kn by Kn,a >< 

^l/2a+l i^ (2.8). 

For (3.6), r„ = 0(logn/n'^^'^"'~^)/*^^'^+^)) is obtained by replacing Kn by 
i^n,a ><nV2c'+2 in (2.8). D 

A.3. The asymptotic normality of /^gp. 

Lemma A. 4. Under Assumptions 2.1-2.5, if f &T>a,2{L), we have 
(A.13) e'(Id-P^Jf/V^^0, 
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(A.14) 0'(Id-P^ JW/V^^O, 



(A.15) 6>'(Id-P^ )f/V^^O, 



Kr,' 



(A.16) e'P^^W/Vn^O asn^oo. 

The above results hold uniformly over a G (1/2, r) and 7 G (6/4, r), 6 > 3. 

Proof of Theorem 4.1. Let f = (/(ri),...,/(r„))' and W = (W^i, 
...,Wn)'- Recall Notation A.l, specialized now to I = Iq. By Lemma A.l, 
except for an event with probability tending to zero, 

V^(/3,o - /3.0) = V^(X;,(Id - P^jX,J-ix;^(Id - P^Jf 
(A.17) 



+ V^(x;„(id - p^ jx,j-ix^„(id - P^ JW. 



Since X^^, = 6' + e', then 



V^(x^„(id - p^jx,j-ix^„(id - P^Jf 

(A.18) =n(X;„(Id-P^jX,J-i 

X (0'(Id-P^ Jf/V^ + s'(Id-P^ Jf/V^), 



and 



V^(x',„(id - p^ )x,j-ix;„(id - P,> )W 



(A.19) =n(X;^(Id-P^jX,J-i 

X (e'W/V^-e'PW/V^+0'(Id-P^^)W/V^). 

By Proposition A.l applied to / = /q, ?i(Xj (Id — P^ )X/q)~^ ^ S"^ as 
n — > 00. Also, notice that e'W = X^iLi -Dji where Dj = VFj£j are i.i.d. vectors 
with EY)i = and E{Y)iD[) = ir^Sqg. Then it follows from the multivariate 
central limit theorem that 

n 

(A.20) £'W/v^ = ^Di/V^4iV,,(0,a2s,J. 

Then, by (A.17)-(A.20), Lemma A. 4 and Slutsky's lemma, 

v^(/3,o-/5go)^^^.o(0,cT2s-i). D 

Proof of Lemma A. 4. We begin by proving (A. 13). Recall the defini- 
tion of ffji, J = 1, . . . , g, i = 1, . . . , n, introduced in Appendix A.l. Then, by 
Assumption 2.1, for some constant C^ > we have \£ji\ < C^ for all j and i. 
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except for a set of measure zero. Recall that /k^ is the L2(/Ut) projection 
of / onto Sk„ and so fe„ E L^-^ = {(^'(ri), . . .,g{Tn)y\g G Sk^}. Then, for 
every c > 0, by Markov's inequality and (2.7), 

p(|E;.(id-p^jf/^l>c)< Y. Pi\s',{id-p,.^){/v^\>c) 



/^2 Jn /~i2 Jn 

<-f Y: ^l|f-P2^.^f|ln<-i E ^l|f-f^Jln 



/C77, /ijT, /CjT, /in 



^2 J, 



E ii/-/i^n 



g2 Z^ II-' .'-n.nll/iT 



/^2 Jn 1 

<-f/.,C7(a,L) E :j2^-0 



c- 



fcjT, /Itj, 



The second result of this lemma is very similar to (A. 13). Now W will 
play the role of £j and 9j the role of f . To emphasize more the similarity, 

we shall in fact prove that W'(Id — P^ )6j/^/n -^ for each j G Iq. As 
above, applying Markov's inequality and using now the independence of W 
and ( Xi 7") 1 we obtain 

Jn 

p{\w'iid-p^jej/^\>c)< J2 mw(id-P2.„)0,/V^|>c) 

cr\ _, .. v^ 1 



<-^h^C{^,A) E ^^2:^-0 



c 



^n — -^^n 



for any c > and 7 > 6/4, by (2.7) applied to 9j and recalling Assump- 
tion 2.5. 

For the third result of this lemma notice that, by the same argument as 
in (A. 6), with f replaced now by 0^ we obtain 

n-'/'\e'jiid - p^jf I < y^iKid - p^J0,-||J|f - P^JL 
<^/^||(id-p^j0,-||j|f-f|U, 

where we recall now the definition of / from Section 2.2 and that / = /lr„- 
P(V^||(Id-P^J0,||J|f-f||„>c) 

= PiV^Uid - p^j0,-||„||f - f lu > c,r„) 
+ p(V^||(id-p^j0,||j|f-f|U>c,r^) 
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< p(y^|| (Id - p^je,- lljif - f lU > c) + P(r^) ^ 0, 

by Corollary 3.1 and Lemma A. 3, and since -P(r^) — > by Theorem 2.1 in 
Baraud (2002). 

Finally, notice that if in (A. 5) we replace e^ by W, we obtain, up to 
constants, that P(|n-i/2£^.p^^w| > c) < E{Kn)/cy/n^ 0, since E{kn) < 
Nn X (n/logn)^'^ by (2.1). This completes the proof of this lemma. D 

Proof of Corollary 4.1. The proof of this corollary follows imme- 
diately by replacing Kn throughout the proofs of Theorem 4.1 and Lemma 
A. 4 by Kn,a and Kn,a, respectively. D 

A.4. 

Lemma A. 5. Let \' be a symmetric, sem,ipositive definite q^ q m,atrix. 
Let V'^ he a sym^m^etric, positive definite qo x q^ matrix. Let c G M"^ and 
Co £ W^° , Co = (ci, . . . ,Cq(,). Then the only V which satisfies 

(A. 21) CoV°co = c'Yc for any c G M^ 



IS 



where the O matrices have all elements. 



O(ij-(5ro)xgo *-'(g-9o)x(g-go) 



Proof. Since we seek V symmetric and satisfying (A. 21), then V has, 
in general, the structure below: 

V — f ^qox{q-qo) 

I A' Pi 

V (g-9o)x<2o ^{q-qo)x{q-qo) 

with arbitrary A and B. Since we require (A.21) to hold for any c G M'^, 
we show that A and B are zero matrices of appropriate dimensions. First 
we note that for any symmetric, semipositive definite matrix C, x'Cx = 
for any x implies C = O, with O being the zero matrix. Now let c G M'^ be 

arbitrary and write it as c = (cq, ci), for ci G M''"'"^ Then 

(A. 22) c'Yc = CqV^co + CqAci + c[A'co + c[Bci. 

We find then A and B such that (A. 22) holds, or equivalently, such that 

(A.23) 2coAci + c'lBci = for any c G M^. 

Since we want the above display to hold for any c G M'', then it should, in 
particular, hold for c= (0,ci). Thus, from (A.23), B must satisfy 

(A.24) c'lBci = for any ci G M'?"''". 



<c{p){EY^m^+{EY,un y 
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Now note that since V is semipositive definite, so is B. This, in connection 
to (A. 24), imphes then that B = 0(g„gQ)x(g-go)- Then we have to find A 
that satisfies CqAci = 0, for any cq G M"^" and ci G M'?"''''. Thus, the equation 
must be, in particular, satisfied for the canonical basis in M'"^ and W~'i° , 
respectively, which implies that A = Ogox(ij-go)' 'which completes the proof 
of this lemma. D 

Lemma A. 6 (Rosenthal's inequality). Let Ui, . . . ,Un be independent cen- 
tered random, variables with values in M. For any p>2, we have 

p/2n 

EJ2U, 

i=l 

where C{p) > depends only on p. 

For a proof of this inequality see, for example, Petrov (1995). 
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